Today we will focus on making progress on the data for your final projects. By today (ideally) or Sunday at 5pm at the latest, you will have a tidy data set that contains all of the variables you need to proceed with your analysis for your final project. You will upload the compiled html file from this .Rmd – which will count both for your lab grade and your Assignment 3 grade.
Pick one person to be the designated coder/writer for this assignment. Do all your work on their computer.
Grades for this assignment are designed to help you know where your group is at:
100: On track, nothing left to do
90: Some minor changes, additional work to do
80: Major changes, errors, lots of work to do to stay on track.
Your document should contain the following code and content:
Code to:
Set up your workspace
Load your data (or datasets) into R
Recode, clean, merge, and transform your data as needed.
Produce a table of summary statistics
Produce at least one (and possibly more) informative descriptive figures
Content which:
An overview of your data
What is the source of your data set(s)
What is the unit of observation?
How many observations?
Provides a code book identifying your
A brief substantive description of your descriptive statistics: What does a typical observation in your data set look like.
A list of next steps and/or outstanding questions or goals. For example:
Clarifying theoretical framework and expectations
Specifying linear models to test research question
Fitting and interpreting linear models
Gathering additional data
Producing particular figures (maps, faceted plots)
Below, I’ve integrated these tasks into what I think is a reasonable workflow – so you’ll be alternating between code and content as you progress with this assignment.
Use code from previous labs and class. At a minimum, you’ll want to load the packages of the tidyverse and maybe something like haven for reading data.
# Set up workspace
This will vary for each group. I believe most of you are loading data directly from the web. If you’re loading data stored locally, you’ll need to write code to set the working directory to where your data and .Rmd file are saved so that R can find the data.
# Load data
Once you’ve loaded your data, create a small codebook outlining the following:
The value/range should describe the values each variable –once you’ve recoded it – can take. You may need to look at the data using commands like table() and summary() to clarify this.
Use the mutate() command in combination with
case_when() to recoded categorical variables using logical indexing.
ifelse() to recode binary variables. Also useful for recoding values that should be NA
Remember to save the output of your recoding back into the data set
If you’re working with multiple datasets, you’ll need to merge them together using left_join()
Use the by=c("var1" = "var2") argument to merge dataset 1 with dataset 2 using var1 in dataset 1 and var2 in dataset today.
Make sure that the values in the variables you merge by match up. If you’re merging together state level data, make sure that both datasets spell each state name exactly the same way (e.g. you don’t want one data set to have “D.C.” and another to have “District of Columbia”)
Save the output into a temporary data frame. Check the dimensions. The rows should equal the number of rows in your main (final) data set. The columns will include the additional unique variables. Merge in additional datasets, each time creating a temporary data frame, to check the results of your merge. When you’re satisfied, save this data frame into a new obejct that will be the data frame you use for your analysis.
# Recode
Once you’ve recoded your data:
Create a table of summary statistics for your outcome, key predictor(s), and covariates
Present at least one descriptive figure, that illustrates the distribution of your outcome or key predictor, or shows an interesting relationship between variables.
Interpret your results
Producing a table of summary statistics requires a little foresight.
Essentially you want to make a data frame where each row is a (numeric) variable, and each column is a statistic (minimum, 25th percentile, median, mean, 75th percentile, max, Number of missing).
To do this, I would:
create a object called the_vars which contains the names (in quotation marks) of the variables you want to summarize.
Select these variables from your data set. using df%>%select(all_of(the_vars))
Use %>%pivot_wider() specifying cols=select(all_of(the_vars)), and names_to equals "Variable" and values_to = "value" to transform this wide dataset into a long dataset
Then use %>%group_by(Variable)%>% and summarise() to calculate the statistics for each variable of interest (e.g. %>%summarise(Mean = mean(value, na.rm=T))))
Save the output to an object called something like sum_df
In a new chunk use knitr::kable(sum_df) %>% kableExtra::kable_styling() to format your table. Set echo=F in the code chunk head
# Summarise data
To create a figure, you’ll need to specificy the following
data (e.g. df %>%)
aesthetic mappings, ggplot(aes(x = predictor, y = outcome))
geometries
Univariate: geom_density(), geom_boxplot() geom_histogram()
Bivariate: geom_point() (for a scatterplot), geom_line() for a trend.
Once you have a minimal working example, play around with other grammars of graphics:
labs() for custom labels
theme_XXX for custom themes
facet_wrap(~group) to produce the same plot facetted by some categorical grouping variable
When you’re happy with your figure, save it as object in R (e.g. fig1 <- df %>% ggplot(aes(predictor, outcome))+geom_point()). Put that object in its own chunk to display it in your document.
Don’t let the perfect be the enemy of the good.
# Descriptive figures
Please provide an overview of the data (source, number of observations, unit of analysis).
Describe a typical observation, making reference to the statistics in your summary table.
Offer a substantive interpretation of your descriptive figure(s). What do they tell us about the distribution of a key variable, or the relationship between two variables.
Use this section to outline next steps for your group and assign tasks and responsibilities. If you have any specific questions /requests/things I can provide help with, please let me know.